Introduction

In this notebook we'll analyze some of Joyce's wordplay in Ulysses, using more complicated regular expressions.

Tokenizing Without Punctuation

To tokenize the chapter and throw out the punctuation, we can use the regular expression \w+. Note that this will split up contractions like "can't" into ["can","t"].


In [173]:
%matplotlib inline

import nltk, re, io

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.pylab import *

In [174]:
txtfile = 'txt/08lestrygonians.txt'

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
with io.open(txtfile) as f:
    tokens = tokenizer.tokenize(f.read())
print tokens[1000:1020]


[u'really', u'It', u's', u'always', u'flowing', u'in', u'a', u'stream', u'never', u'the', u'same', u'which', u'in', u'the', u'stream', u'of', u'life', u'we', u'trace', u'Because']

In [175]:
print tokenizer.tokenize("can't keep a contraction together!")


['can', 't', 'keep', 'a', 'contraction', 'together']

Method 1: TokenSearcher Object

The first method for searching for regular expressions in a set of tokens is the TokenSearcher object. This can be fed a regular expression that searches across tokens, and it will search through each token. This provides a big advantage: we don't have to manually break all of our tokens into n-grams ourselves, we can just let the TokenSearcher do the hard work.

Here's an example of how to create and call that object:


In [214]:
tsearch = nltk.TokenSearcher(tokens)
s_s_ = tsearch.findall(r'<s.*> <.*> <s.*> <.*> <.*>')
print len(s_s_)
for s in s_s_:
    print ' '.join(s)


78
scotch A sugarsticky girl shovelling
selling off some old furniture
saw flapping strongly wheeling between
seabirds gulls seagoose Swans from
sound She s not exactly
sandwichmen marched slowly towards him
street after street Just keep
smart girls sitting inside writing
suited her small head Sister
she If she had married
some sticky stuff Flies picnic
she looked soaped all over
saint Kevin s parade Pen
s womaneyes said melancholily Now
said He s a caution
said He s always bad
speak Look straight in her
serge dress she had two
sugary flour stuck to her
simply Child s head too
something to stop that Life
she had so many children
said The spoon of pap
still All skedaddled Why he
squad Turnkey s daughter got
sun slowly shadowing Trinity s
say Other steps into his
spewed Provost s house The
s uniform since he got
say it s healthier Windandwatery
sweating Irish stew into their
s daughter s bag and
some king s mistress His
s bottle shoulders On his
street west something changed Could
s corner still pursued Jingling
shovelled gurgling soup down his
stewgravy with sopping sippets of
server gathered sticky clattering plates
second helping stared towards the
split their skulls open Moo
sheepsnouts bloodypapered snivelling nosejam on
smokinghot thick sugary Famished ghosts
something the somethings of the
some fellow s digestion Religions
sandwich Yes sir Like a
s the style Who s
sandwich into slender strips Mr
see Part shares and part
strongly to speed it set
said He s the organiser
snuffled and scratched Flea having
such and such replete Too
strips of sandwich fresh clean
s no straight sport going
soaked and softened rolled pith
sturgeon high sheriff Coffey the
soup Geese stuffed silly for
s the same fish perhaps
sky No sound The sky
see Never speaking I mean
something fall see if she
said They stick to you
said He s a safe
say He s not too
sake What s yours Tom
said Certainly sir Paddy Leonard
said with scorn Mr Byrne
said A suckingbottle for the
sweet then savoury Mr Bloom
s confectioner s window of
said Molesworth street is opposite
street different smell Each person
spring the summer smells Tastes
shameless not seeing That girl
school I sentenced him to
sunlight Tan shoes Turnedup trousers
stuck Ah soap there I

Method 2: Bigram Splitting Method

Another way of searching for patterns, one that may be needed if we want to use criteria that would be hard to implement with a regular expression (such as finding two words that are the same length next to each other), is to assemble all of the tokens into bigrams.

Suppose we are looking for two words that start with the same letter. We can do this by iterating through a set of bigrams (we'll use a built-in NLTK object to generate bigrams), and apply our search criteria to the first and second words independently.

To create bigrams, we'll use the nltk.bigrams() method, feeding it a list of tokens.

When we do this, we can see there's a lot of alliteration in this chapter.


In [177]:
def printlist(the_list):
    for item in the_list:
        print item

In [178]:
alliteration = []
for (i,j) in nltk.bigrams(tokens):
    if i[:1]==j[:1]:
        alliteration.append( ' '.join([i,j]) )
        

print "Found",len(alliteration),"pairs of words starting with the same letter:"
printlist(alliteration[:10])
printlist(alliteration[-10:])


Found 551 pairs of words starting with the same letter:
shovelling scoopfuls
their tummies
riverward reading
wife will
to the
in it
he had
their theology
out of
themselves to
funds for
to the
it is
short sighs
cream curves
hasty hand
Agendath Afternoon
she said
Potato Purse
his hip

In [179]:
lolly = []
for (i,j) in nltk.bigrams(tokens):
    if len( re.findall('ll',i) )>0:
        if len( re.findall('l',j) )>0:
            lolly.append( ' '.join([i,j]) )
    elif len( re.findall('ll',j) )>0:
        if len( re.findall('l',i) )>0:
            lolly.append(' '.join([i,j]) )

print "Found",len(lolly),"pairs of words, one containing 'll' and the other containing 'l':"
print "First 25:"
printlist(lolly[:25])


Found 107 pairs of words, one containing 'll' and the other containing 'l':
First 25:
girl shovelling
shovelling scoopfuls
All heartily
like all
himself well
collie floating
quaywalls gulls
ball Elijah
swells floated
gull Flaps
treacly swells
swells lazily
parallel parallax
all Only
black celluloid
envelopes Hello
Kansell sold
Phil Gilligan
Val Dillon
flag fell
wallpaper Dockrell
probably Well
blizzard collar
gaily Milly
medicinebottle Pastille

In [180]:
lolly = []
for (i,j) in nltk.bigrams(tokens):
    if len( re.findall('rr',i) )>0:
        if len( re.findall('r',j) )>0:
            lolly.append( ' '.join([i,j]) )
    elif len( re.findall('rr',j) )>0:
        if len( re.findall('r',i) )>0:
            lolly.append(' '.join([i,j]) )

print "Found",len(lolly),"pairs of words, one containing 'r' and the other containing 'r':"
printlist(lolly)


Found 22 pairs of words, one containing 'r' and the other containing 'r':
daguerreotype atelier
supperroom or
from Harrison
terrible for
Farrell Mr
Weightcarrying huntress
or fivebarred
Dr Murren
marching irregularly
irregularly rounded
suburbs jerrybuilt
jerrybuilt Kerwan
garden Terrific
Portobello barracks
artificial irrigation
irrigation Bleibtreustrasse
dropping currants
currants Screened
whispered Prrwht
ravenous terrier
Earlsfort terrace
Where Hurry

Functionalizing Bigram Searches

We can functionalize the search for patterns with a single and double character shared, i.e., dropping currants (the letter r).


In [181]:
def double_letter_alliteration(c,tokens):
    """
    This function finds all occurrences of double-letter and single-letter 
    occurrences of the character c.
    
    This function is called by all_double_letter_alliteration().
    """
    allall  = []
    for (i,j) in nltk.bigrams(tokens):
        if len( re.findall(c+c,i) )>0:
            if len( re.findall(c,j) )>0:
                lolly.append( ' '.join([i,j]) )
        elif len( re.findall(c+c,j) )>0:
            if len( re.findall(c,i) )>0:
                allall.append(' '.join([i,j]) )
    return allall

Now we can use this function to search for the single-double letter pattern individually, or we can define a function that will loop over all 26 letters to find all matching patterns.


In [182]:
printlist(double_letter_alliteration('r',tokens))


from Harrison
or fivebarred
Dr Murren
marching irregularly
suburbs jerrybuilt
garden Terrific
Portobello barracks
artificial irrigation
dropping currants
whispered Prrwht
ravenous terrier
Earlsfort terrace
Where Hurry

In [183]:
printlist(double_letter_alliteration('o',tokens))


shovelling scoopfuls
Some school
No Blood
word Good
story too
lower Looking
Those poor
Trousers Good
to loosen
Women too
People looking
photography Poor
or oakroom
Professor Goodwin
old Goodwin
now Cook
Pothunters too
Tommy Moore
mortarboards Looking
corporation too
of goosegrease
two loonies
of food
Women too
money too
from School
you poor
Molly looks
of bloodhued
lustrous blood
sloppy food
Working tooth
to look
own tooth
onions mushrooms
open Moo
sheepsnouts bloodypapered
Hello Bloom
missionary too
olives too
Not logwood
of wood
some good
more Fool
for food
of Moore
wonder Coolsoft
gods food
not too
down too
some bloody
of poor
Horse drooping
stronger too
person too
Not smooth
to Poor
bluecoat school
pocket took

In [184]:
import string

def all_double_letter_alliteration(tokens):
    all_all = []
    alphabet = list(string.ascii_lowercase)
    for aleph in alphabet:
        results = double_letter_alliteration(aleph,tokens) 
        print "Matching",aleph,":",len(results)
        all_all += results
    return all_all

In [185]:
allall = all_double_letter_alliteration(tokens)
print len(allall)


Matching a : 1
Matching b : 3
Matching c : 4
Matching d : 8
Matching e : 109
Matching f : 1
Matching g : 5
Matching h : 1
Matching i : 0
Matching j : 0
Matching k : 0
Matching l : 47
Matching m : 1
Matching n : 16
Matching o : 59
Matching p : 1
Matching q : 0
Matching r : 13
Matching s : 31
Matching t : 38
Matching u : 0
Matching v : 0
Matching w : 0
Matching x : 0
Matching y : 0
Matching z : 0
338

That's a mouthful of alliteration! We can compare the number of words that matched this (one, single) search for examples of alliteration to the total number of words in the chapter:


In [186]:
double(len(allall))/len(tokens)


Out[186]:
0.026187340202990624

Holy cow - 2.6% of the chapter is just this one alliteration pattern, of having two neighbor words: one with a double letter, and one with a single letter.


In [216]:
print len(allall)
printlist(allall[:20])


338
bawling maaaaaa
ball bobbed
bob Bubble
buckets wobbly
collecting accounts
Scotch accent
Scotch accent
crown Accept
had plodded
dumdum Diddlediddle
and bidding
remembered Hidden
naked goddesses
said Paddy
standing Paddy
Rochford nodded
bluey greeny
goes Fifteen
they feel
They wheeled

Let's look at the pattern taken one step further: we'll look for double letters in neighbor words.


In [218]:
def match_double(aleph,tokens):
    matches = []
    for (i,j) in nltk.bigrams(tokens):
        if len( re.findall(aleph+aleph,i) )>0:
            if len( re.findall(aleph+aleph,j) )>0:
                matches.append(' '.join([i,j]))
    return matches

def double_double(tokens):
    dd = []
    alphabet = list(string.ascii_lowercase)
    for aleph in alphabet:
        results = match_double(aleph, tokens)
        print "Matching %s%s: %d"%(aleph,aleph,len(results))
        dd += results
    return dd

print "Neighbor words with double letters:"
dd = double_double(tokens)
printlist(dd)


Neighbor words with double letters:
Matching aa: 0
Matching bb: 0
Matching cc: 0
Matching dd: 0
Matching ee: 5
Matching ff: 2
Matching gg: 0
Matching hh: 0
Matching ii: 0
Matching jj: 0
Matching kk: 0
Matching ll: 15
Matching mm: 0
Matching nn: 2
Matching oo: 4
Matching pp: 3
Matching qq: 0
Matching rr: 0
Matching ss: 1
Matching tt: 1
Matching uu: 0
Matching vv: 0
Matching ww: 0
Matching xx: 0
Matching yy: 0
Matching zz: 0
wheeling between
Fleet street
Three cheers
greens See
green cheese
scruff off
sheriff Coffey
quaywalls gulls
parallel parallax
wallpaper Dockrell
Tisdall Farrell
belly swollen
still All
Silly billies
ll tell
swollen belly
Wellmannered fellow
ball falls
full All
Kill Kill
numbskull Will
William Miller
Penny dinner
canny Cunning
looks too
Goosestep Foodheated
loonies mooching
Moo Poor
Happy Happier
Happy Happy
sopping sippets
pressed grass
platt butter

Acronyms

Let's take a look at some acronyms. For this application, it might be better to tokenize by sentence, and extract acronyms for sentences.


In [219]:
with io.open(txtfile) as f:
    sentences = nltk.sent_tokenize(f.read())
print len(sentences)


1979

In [220]:
acronyms = []
for s in sentences:
    s2 = re.sub('\n',' ',s)
    words = s2.split(" ")
    acronym = ''.join(w[0] for w in words if w<>u'')
    acronyms.append(acronym)
            
print len(acronyms)
print "-"*20
printlist(acronyms[:10])
print "-"*20
printlist(sentences[:10]) # <-- contains newlines, but removed to create acronyms


1979
--------------------
Prlpbs
Asgssocfacb
Sst
Bftt
LacmtHMtK
G
S
O
Sohtsrjw
AsY
--------------------
Pineapple rock, lemon platt, butter scotch.
A sugarsticky girl
shovelling scoopfuls of creams for a christian brother.
Some school
treat.
Bad for their tummies.
Lozenge and comfit manufacturer to His
Majesty the King.
God.
Save.
Our.
Sitting on his throne sucking red
jujubes white.
A sombre Y.M.C.A.

In [221]:
from nltk.corpus import words

In [222]:
acronyms[101:111]


Out[222]:
[u'Khttast',
 u'Twl',
 u'Lfg',
 u'W',
 u'Htdatacpb',
 u'Etfpsic',
 u'Nab',
 u'Tbbuotwosfubtb',
 u'Nsdf',
 u'AtdIttscootEKpiuitwfya']

In [ ]: